Search CORE

118 research outputs found

On the Theory of Spatial and Temporal Locality

Author: Snir Marc
Yu Jing
Publication venue
Publication date: 01/07/2005
Field of study

This paper studies the theory of caching and temporal and spatial locality. We show the following results: (1) hashing can be used to guarantee that caches with limited associativity behave as well as fully associative cache; (2) temporal locality cannot be characterized using one, or few parameters; (3) temporal locality and spatial locality cannot be studied separately; and (4) unlike temporal locality, spatial locality cannot be managed efficiently online

Illinois Digital Environment for Access to Learning and Scholarship Repository

Comparisons between linear functions can help

Author: Snir Marc
Publication venue: 'Elsevier BV'
Publication date: 30/09/1982
Field of study

AbstractAn example is provided of a sorting-type decision problem which can be solved in fewer steps by using comparisons between linear functions of the inputs, rather than comparisons between the inputs themselves. This disproves a conjecture of Yao [14] and Yap [16]. Several extensions are presented

Elsevier - Publisher Connector

MiniAMR - A miniapp for Adaptive Mesh Refinement

Author: Sasidharan Aparna
Snir Marc
Publication venue
Publication date: 18/08/2016
Field of study

This report describes the detailed implementation of MiniAMR - a software for octree-based adaptive mesh refinement (AMR) that can be used to study the communication costs in a typical AMR simulation. We have designed new data structures and refinement/coarsening algorithms for octree-based AMR and studied the performance improvements using a similar software from Sandia National Laboratory. We have also introduced the idea of amortized load balancing to AMR in this report. In addition to this, we have also provided a low-overhead distributed load balancing scheme for AMR applications that perform sub-cycling (refinement in time).Ope

Illinois Digital Environment for Access to Learning and Scholarship Repository

Damaris: How to Efficiently Leverage Multicore Parallelism to Achieve Scalable, Jitter-free I/O

Author: Antoniu Gabriel
Cappello Franck
Dorier Matthieu
Orf Leigh
Snir Marc
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 24/09/2012
Field of study

International audienceWith exascale computing on the horizon, the performance variability of I/O systems represents a key challenge in sustaining high performance. In many HPC applications, I/O is concurrently performed by all processes, which leads to I/O bursts. This causes resource contention and substantial variability of I/O performance, which significantly impacts the overall application performance and, most importantly, its predictability over time. In this paper, we propose a new approach to I/O, called Damaris, which leverages dedicated I/O cores on each multicore SMP node, along with the use of shared-memory, to efficiently perform asynchronous data processing and I/O in order to hide this variability. We evaluate our approach on three different platforms including the Kraken Cray XT5 supercomputer (ranked 11th in Top500), with the CM1 atmospheric model, one of the target HPC applications for the Blue Waters postpetascale supercomputer project. By overlapping I/O with computation and by gathering data into large files while avoiding synchronization between cores, our solution brings several benefits: 1) it fully hides jitter as well as all I/O-related costs, which makes simulation performance predictable; 2) it increases the sustained write throughput by a factor of 15 compared to standard approaches; 3) it allows almost perfect scalability of the simulation up to over 9,000 cores, as opposed to state-of-the-art approaches which fail to scale; 4) it enables a 600\% compression ratio without any additional overhead, leading to a major reduction of storage requirements

HAL-CentraleSupelec

HAL - Lille 3

INRIA a CCSD electronic archive server

HAL-Rennes 1

Comparing archival policies for Blue Waters

Author: Cappello Franck
Jacquelin Mathias
Marchal Loris
Robert Yves
Snir Marc
Publication venue: HAL CCSD
Publication date: 01/01/2011
Field of study

This paper introduces two new tape archival policies that can im- prove tape archive performance in certain regimes, compared to the classical RAIT (Redundant Array of Independent Tapes) policy. The first policy, PARALLEL, still requires as many parallel tape drives as RAIT but pre-computes large data stripes that are written contiguously on tapes to increase write/read performance. The second policy, VERTICAL, writes contiguous data into a single tape, while updating error correcting information on the fly and delaying its archival until enough data has been archived. This second approach reduces the number of tape drives used for every user request to one. The performance of the three RAIT, PARALLEL and VERTICAL policies is assessed through extensive simulations, using a hardware configuration and a distribution of I/O requests similar to these expected on the Blue Waters system. These simulations show that VERTICAL is the most suitable policy for small files, whereas PARALLEL must be used for files larger than 1 GB. We also demonstrate that RAIT never outperforms both proposed policies, and that a heterogeneous policies mixing VERTICAL and PARALLEL performs 10 times better than any other policy

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

An Unstructured Parallel Least-Squares Spectral Element Solver for Incompressible Flow Problems

Author: Ainsworth M.
Nool M.
Proot M. M. J.
Snir Marc
Publication venue: 'American Institute of Aeronautics and Astronautics (AIAA)'
Publication date: 01/01/2003
Field of study

The parallelization of the least-squares spectral element formulation of the Stokes problem has recently been discussed for incompressible flow problems on structured grids. In the present work, the extension to unstructured grids is discussed. It will be shown that, to obtain an efficient and scalable method, two different kinds of distribution of data are required involving a rather complicated parallel conversion between the data. Once the data conversion has been performed, a large symmetric positive definite algebraic system has to be solved iteratively. It is well known that the Conjugate Gradient method is a good choice to solve such systems. To improve the convergence rate of the Conjugate Gradient process, both Jacobi and Additive Schwarz preconditioners are applied. The Additive Schwarz preconditioner is based on domain decomposition and can be implemented such that a preconditioning step corresponds to a parallel matrix-by-vector product. The new results reveal that the Additive Schwarz preconditioner is very suitable for the p-refinement version of the least-squares spectral element method. To obtain good portable programs which may run on distributed-memory multiprocessors, networks of workstations as well as shared-memory machines we use MPI (Message Passing Interface). Numerical simulations have been performed to validate the scalability of the different parts of the proposed method. The experiments entailed simulating several large scale incompressible flows on a Cray T3E and on an SGI Origin 3800 with the number of processors varying from one to more than one hundred. The results indicate that the present method has very good parallel scaling properties making it a powerful method for numerical simulations of incompressible flows

Crossref

CWI's Institutional Repository

CCL: a portable and tunable collective communication library for scalable parallel computers

Author: Alex Ho
Ching-tien Ho
Jehoshua Bruck
Marc Snir
Pablo Elustondo
Robert Cypher
Senior Member
Senior Member
Shlomo Kipnis
Vasanth Bala
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1995
Field of study

A collective communication library for parallel computers includes frequently used operations such as broadcast, reduce, scatter, gather, concatenate, synchronize, and shift. Such a library provides users with a convenient programming interface, efficient communication operations, and the advantage of portability. A library of this nature, the Collective Communication Library (CCL), intended for the line of scalable parallel computer products by IBM, has been designed. CCL is part of the parallel application programming interface of the recently announced IBM 9076 Scalable POWERparallel System 1 (SP1). In this paper, we examine several issues related to the functionality, correctness, and performance of a portable collective communication library while focusing on three novel aspects in the design and implementation of CCL: 1) the introduction of process groups, 2) the definition of semantics that ensures correctness, and 3) the design of new and tunable algorithms based on a realistic point-to-point communication model

CiteSeerX

Caltech Authors

Scheduling the I/O of HPC Applications Under Congestion

Author: Aupy Guillaume
Benoit Anne
Cappello Franck
Gainaru Ana
Robert Yves
Snir Marc
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/2014
Field of study

International audienceA significant percentage of the computing capacity of large-scale platforms is wasted because of interferences incurred by multiple applications that access a shared parallel file system concurrently. One solution to handling I/O bursts in large-scale HPC systems is to absorb them at an intermediate storage layer consisting of burst buffers. However, our analysis of the Argonne's Mira system shows that burst buffers cannot prevent congestion at all times. Consequently, I/O performance is dramatically degraded, showing in some cases a decrease in I/O throughput of 67%. In this paper, we analyze the effects of interference on application I/O bandwidth and propose several scheduling techniques to mitigate congestion. We show through extensive experiments that our global I/O scheduler is able to reduce the effects of congestion, even on systems where burst buffers are used, and can increase the overall system throughput up to 56%. We also show that it outperforms current Mira I/O schedulers

HAL-ENS-LYON

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1